19 research outputs found

    Recognizing speculative language in research texts

    Get PDF
    This thesis studies the use of sequential supervised learning methods on two tasks related to the detection of hedging in scientific articles: those of hedge cue identification and hedge cue scope detection. Both tasks are addressed using a learning methodology that proposes the use of an iterative, error-based approach to improve classification performance, suggesting the incorporation of expert knowledge into the learning process through the use of knowledge rules. Results are promising: for the first task, we improved baseline results by 2.5 points in terms of F-score by incorporating cue cooccurence information, while for scope detection, the incorporation of syntax information and rules for syntax scope pruning allowed us to improve classification performance from an F-score of 0.712 to a final number of 0.835. Compared with state-of-the-art methods, the results are very competitive, suggesting that the approach to improving classifiers based only on the errors commited on a held out corpus could be successfully used in other, similar tasks. Additionaly, this thesis presents a class schema for representing sentence analysis in a unique structure, including the results of different linguistic analysis. This allows us to better manage the iterative process of classifier improvement, where different attribute sets for learning are used in each iteration. We also propose to store attributes in a relational model, instead of the traditional text-based structures, to facilitate learning data analysis and manipulation

    A Crowd-Annotated Spanish Corpus for Humor Analysis

    Full text link
    Computational Humor involves several tasks, such as humor recognition, humor generation, and humor scoring, for which it is useful to have human-curated data. In this work we present a corpus of 27,000 tweets written in Spanish and crowd-annotated by their humor value and funniness score, with about four annotations per tweet, tagged by 1,300 people over the Internet. It is equally divided between tweets coming from humorous and non-humorous accounts. The inter-annotator agreement Krippendorff's alpha value is 0.5710. The dataset is available for general use and can serve as a basis for humor detection and as a first step to tackle subjectivity.Comment: Camera-ready version of the paper submitted to SocialNLP 2018, with a fixed typ

    Evaluación de modelos de ngrama construidos de derecha a izquierda

    Get PDF
    En este trabajo se presenta la evaluación de varios modelos de ngramas en dos escenarios simétricos: en el primero, los ngramas del modelo se construyen leyendo el corpus de izquierda a derecha, y en el segundo lo hacen de derecha a izquierda. En cada caso, se estudia su rendimiento, utilizando la medida de perplejidad, considerando diferentes opciones de cut-off, reducción de vocabulario e interpolación con modelos de clases. Los resultados, aunque no concluyentes, parecen indicar que los valores de perplejidad son menores para el segundo escenario

    Uruguay’s COVID-19 contact tracing app reveals the growing importance of data governance frameworks

    Get PDF
    Uruguay’s pioneering adoption of Google and Apple’s contact tracing interface is understandable given the urgent need to halt the spread of COVID-19. But this move also puts serious issues of governance, health policy, and human rights in the hands of software developers who have neither the expertise nor the legitimacy required to properly address them. During today’s crisis just as in the future, governments must ask the right questions about data governance if they are to come up with the right policies, write Fabrizio Scrollini (ILDA), Javier Baliosian (Universidad de la República), Lorena Etcheverry (Universidad de la República), and Guillermo Moncecchi (Universidad de la República)

    Containers in Montevideo: a Multi source Image Dataset

    Get PDF
    This work presents Clean-Dirty Containers in Montevideo (CDCM), a novel dataset for detection and classi cation of residue con-tainers. Images were collected from several sources, including Google Street View, Social Networks and smarthpone taken photos. The dataset is publicly available under a Creative Commons License.Sociedad Argentina de Informática e Investigación Operativ

    Restauración automática de acentos ortográficos en adverbios interrogativos

    Get PDF
    La omisión de acentos ortograáficos es un error tipográfi co muy frecuente en el idioma español; su restauración automática consiste en la inserción de acentos omitidos en los lugares que son necesarios. Los adverbios interrogativos son un caso especialmente di ficultoso de este problema, ya que en muchas ocasiones no existen marcas claras que indiquen su presencia. Este trabajo presenta dos técnicas de aprendizaje automático, Conditional Random Fields (CRF) y Support Vector Ma- chines (SVM), aplicadas a la resolución del problema de la restauración automática de acentos ortográ cos para el caso especifí co de los adverbios interrogativos. Se obtuvieron buenos resultados con ambas técnicas, siendo sensiblemente superior el resultado obtenido utilizando un clasificador basado en CRF, y que utiliza como atributos los tokens que más comúnmente preceden y siguen a los adverbios interrogativos.Sociedad Argentina de Informática e Investigación Operativ

    Lavinia :a collaborative NLP platform

    Get PDF
    In this article we present Lavinia, a UIMA-based, collaborative web platform for Natural Language Processing, were both NLP software developers and linguistic analysts can test, use and share di©erent NLP components in a straightforward way. Lavinia allows users to execute UIMA components using a web browser: they can create and conîgure pipelines of tasks, and view their execution results, without installing any extra-software. We believe that this approach can help people with little computational or programming background to get closer to NLP tools, and NLP component developers to easily share their work

    Eventos de supercontagio.

    Get PDF
    Los eventos de supercontagio, aquellos donde se produce una transmisión de la enfermedad a un número de personas mucho mayor al promedio para esa misma enfermedad, presentan un riesgo importante para el manejo de la pandemia de COVID-19 en los próximos meses. En esta nota intentamos avanzar en la caracterización, a partir de un repaso de la literatura existente, de los eventos de supercontagio, entender su relevancia en el marco de la COVID-19, y presentar algunas posibles acciones que, a través del control de este tipo de eventos, podrían ser útiles para el manejo general de la pandemia, en especial en el caso de Uruguay

    Détection du langage spéculatif dans la littérature scientifique

    No full text
    This thesis studies the use of sequential supervised learning methods on two tasks related to the detection of hedging in scientific articles: those of hedge cue identification and hedge cue scope detection. Both tasks are addressed using a learning methodology that proposes the use of an iterative, error-based approach to improve classification performance, suggesting the incorporation of expert knowledge into the learning process through the use of knowledge rules. Results are promising: for the first task, we improved baseline results by 2.5 points in terms of F-score by incorporating cue cooccurence information, while for scope detection, the incorporation of syntax information and rules for syntax scope pruning allowed us to improve classification performance from an F-score of 0.712 to a final number of 0.835. Compared with state-of-the-art methods, the results are very competitive, suggesting that the approach to improving classifiers based only on the errors commited on a held out corpus could be successfully used in other, similar tasks. Additionaly, this thesis presents a class schema for representing sentence analysis in a unique structure, including the results of different linguistic analysis. This allows us to better manage the iterative process of classifier improvement, where different attribute sets for learning are used in each iteration. We also propose to store attributes in a relational model, instead of the traditional text-based structures, to facilitate learning data analysis and manipulation.Ce travail de thèse propose une méthodologie visant la résolution de certains problèmes de classification, notamment ceux concernant la classification séquentielle en tâches de Traitement Automatique des Langues. Afin d'améliorer les résultats de la tâche de classification, nous proposons l'utilisation d'une approche itérative basée sur l'erreur, qui intègre, dans le processus d'apprentissage, des connaissances d'un expert représentées sous la forme de "règles de connaissance". Nous avons appliqué la méthodologie à deux tâches liées à la détection de la spéculation ("hedging") dans la littérature scientifique: la détection de segments textuels spéculatifs ("hedge cue identification") et la détection de la couverture de ces segments ("hedge cue scope detection"). Les résultats son prometteurs: pour la première tâche, nous avons amélioré le F-score de la baseline de 2,5 points en intégrant des données sur la co-occurrence de segments spéculatifs. Concernant la deuxième tâche, l'intégration d'information syntaxique et des règles pour l'élagage syntaxique ont permis d'améliorer les résultats de la classification de 0,712 à 0,835 (F-score). Par rapport aux méthodes de l'état de l'art, les résultats sont très bons et ils suggèrent que l'approche consistant à améliorer les classifieurs basées seulement sur des erreurs commises dans un corpus, peut être également appliquée à d'autres tâches similaires. Qui plus est, ce travail de thèse propose un schéma de classes permettant de représenter l'analyse d'une phrase dans une structure unique qui intègre les résultats de différentes analyses linguistiques. Cela permet de mieux gérer le processus itératif d'amélioration du classifieur, dans lequel différents ensembles d'attributs d'apprentissage sont utilisés à chaque itération. Nous proposons également de stocker les attributs dans un modèle relationnel au lieu des structures textuelles classiques, afin de faciliter l'analyse et la manipulation des données apprises
    corecore